S. Shinha et al.^{1} proposed Likelihood-free Importance Weights (LFIW) to compensate distribution shift between transitions taken from the current policy and those in the replay buffer (taken from many behavior policies).
The concept is similar to V-trace^{2} and Retrace^{3}, however, LFIW doesn’t require episodic treatment and estimates probability ratio \(d^{\pi}/d^{\mathcal{D}}\) by using extra network \(w_{\psi}(s,a)\). First of all, the current distribution \(d^{\pi}\) is approximated by that in the secondary small (fast) replay buffer \(d^{\mathcal{D}_f}\).
The estimation of the probability ratio is based on the following Lemma;
\[ D(P\|Q) \geq \mathbb{E}_P[f^{\prime}(w(x))] - \mathbb{E}_Q[f^{\ast}(f^{\prime}(w(x)))] \text{,}\]
where \(D(\cdot\|\cdot)\) is f-divergence, \( f \) is any convex lower-semicontinuous function, \(f^{\prime}\) is the first order derivative, and \(f^{\ast}\) is the convex conjugate. The equality is achieved when \(w=dP/dQ\).
By minimizing \(L_{w}(\psi)=\mathbb{E}_{\mathcal{D}}[f^{\ast}(f^{\prime}(w_{\psi}(s,a)))]-\mathbb{E}_{\mathcal{D}_f}[f^{\prime}(w_{\psi}(s,a))]\), the network \(w_{\psi}(s,a)\) approches to \(d^{\mathcal{D}_f}/d^{\mathcal{D}}\).
Additionally, to stabilize network update, \(w_{\psi}(s,a)\) is normalized over the main (slow) replay buffer with temperature hyperparameter \(T\);
\[\tilde{w}_{\psi}(s,a) = \frac{w_{\psi}(s,a)^{1/T}}{\mathbb{E}_{\mathcal{D}}[w_{\psi}(s,a)^{1/T}]}\]
Unfortunately, this paper was rejected at ICLR 2021. We hope an improved paper will come up soon.
You can implement LFIW with cpprb. Two ReplayBuffer
with different sizes are required.
S. Shinha et al., “Experience Replay with Likelihood-free Importance Weights”, (2020) arXiv:2006.13169 ↩︎
L. Espeholt et al., “IMPALA: Scalable Distributed Deep-RL with Importance Weighted Actor-Learner Architectures”, ICML (2018), PMLR 80:1407-1416 (arXiv:1802.01561, code) ↩︎
R. Munos et al., “Safe and Efficient Off-Policy Reinforcement Learning”, NeurIPS (2016) (arXiv:1606.02647) ↩︎